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Abstract 

We introduce a nonlinear aggregation type classifier for functional data 
defined on a separable and complete metric space. The new rule is built 
up from a collection of M arbitrary training classifiers. If the classifiers 
are consistent, then so is the aggregation rule. Moreover, asymptotically 
the aggregation rule behaves as well as the best of the M classifiers. The 
results of a small simulation are reported both, for high dimensional and 
functional data, and a real data example is analyzed. 
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1 Introduction 

Supervised classification is still one of the hot topics for high dimensional and 
functional data due to the importance of their applications and the intrinsic 
difficulty in a general setup. In this context, there is a vast literature on clas¬ 
sification methods which include: linear classification, fc-nearest neighbors and 
kernel rules, classification based on partial least squares, reproducing kernels 
or depth measures. Complete surveys of the literature are the works by Bai'llo 
et al. [I], Cuevas m and Delaigle and Hall m- In the book Contributions 
in infinite-dimensional statistics and related topics [7], there are also several 
recent advances in supervised and unsupervised classification. See for instance, 
Chapters 2, 5, 22 or 48, or directly. Chapter 1 of this issue (Bongiorno et al. 
0)- In this context, very recently there have been of great interest to develop 
aggregation methods. In particular, there is a large list of linear aggregation 
methods like boosting (Breiman [8], Breiman i), random forest (Breiman |10] . 
Biau et al. [3], Biau 0), among others. All these methods exhibit an im¬ 
portant improvement when combining a subset of classifiers to produce a new 
one. Most of the contributions to the aggregation literature have been proposed 
for nonparametric regression, a problem closely related to classification rules, 
which can be obtained just by plugging in the estimate of the regression function 
into the Bayes rule (see for instance, Yang m and Bunea et al. m)- Model 
selection (select the optimal single model from a list of models), convex aggre¬ 
gation (search for the optimal convex combination of a given set of estimators), 
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and linear aggregation (select the optimal linear combination of estimators) are 
important contributions among a large list. 

In the finite dimensional setup, Mojirsheibani m and [18] introduced a com¬ 
bined classifier showing strong consistency under someway hard to verify as¬ 
sumptions involving the Vapnik Chervonenkis dimension of the random parti¬ 
tions of the set of classifiers, which are non-valid in the functional setup. Very 
recently Biau et al. ^ introduced a new nonlinear aggregation strategy for the 
regression problem called COBRA, extending the ideas in Mojirsheibani m to 
the more general setup of nonparametric regression in In the same direction 
but for the classification problem in the infinite dimensional setup, we extend 
the ideas in Mojirsheibani m to construct a classification rule which combines, 
in a nonlinear way, several classifiers to construct an optimal one. We point 
out that our rule allows to combine methods of very different nature, taking 
advantage of the abilities of each expert and allowing to adapt the method to 
different class of datasets. Even though our classifier allows aggregate experts 
of the same nature, the possibility of combine classifiers of different character, 
improves the use of existing rules as the bagged nearest neighbors classifier (see 
for instance Hall and Samworth m)- As in Biau et al. [3], we also introduce 
a more flexible form of the rule which discards a small percentage a of those 
preliminary experts that behaves differently from the rest. Under very mild 
assumptions, we prove consistency, obtain rates of convergence and show some 
optimality properties of the aggregated rule. To build up this classifier, we use 
the inverse function (see also Fraiman et al. [T3]) of each preliminary experts 
which makes the proposal particularly well designed for high dimensional data 
avoiding the curse of dimensionality. It also performs well in functional data 
settings. 

In Section we introduce the new classifier in the general context of a separable 
and complete metric space which combines, in a nonlinear way, the decision of 
M experts (classifiers). A more flexible rule is also considered. In Section 
we state our two main results regarding consistency, rates of convergence and 
asymptotic optimality of the classifier. Asymptotically, the new rule performs 
as the best of the M classifiers used to build it up. Section is devoted to 
show through some simulations the performance of the new classifier in high 
dimensional and functional data for moderate sample sizes. A real data example 
is also considered. All proofs are given in the Appendix. 


2 The setup 

Throughout the manuscript F will denote a separable and complete metric 
space, {X,Y) a random pair taking values in x {0,1} and ^ the probability 
measure of X. The elements of the training sample I?„={(Ari, Yi),..., (X„, Y„)}, 
are iid random elements with the same distribution as the pair {X,Y). The re¬ 
gression function is denoted by r]{x) = E(Y|A1 = x) = P(Y = 1\X = x), the 
Bayes rule by g*{x) = I{,,(x)>i/ 2 } and the optimal Bayes risk by L* = P(g*(A') ^ 

Y)- 
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In order to define our classifier, we split the sample into two subsamples 
Vk = {iXi,Yi),...,{Xk,Yk)} and£z = n+i),..., (X„, y„)} with Z = 

n — fc > 1. With Vk we build up M classifiers gmk ■ X ^ {0,1}, m = 1,..., XI 
which we place in the vector gk(a:) = {gik{x), ■ ■ ■, gMkix)) and, following some 
ideas in HZ], with £i we construct our aggregate classifier as. 


9 t { x ) = II{T„(gk(a:))>l/2}, 


( 1 ) 


where 

n 

r„(gk(x))= ^ Wr,,j{x)Y,, xex, (2) 

j=k+l 

with weights Wnj {x) given by 


WnA^) 


^{gk(3^)=gk(W)} 

J2i=k+1 ^{gk(a:)=gk(Xi)} 


(3) 


Here, 0/0 is assumed to be 0. Like in |4j, for 0 < a < 1 a more flexible version 
of the classifier, called grix, a), can be defined replacing the weights in (in by 

jir / \ W = l = —“} 

Wu,j{x) = ---. 

{-^ X]m = l —a} 

More precisely, the more flexible version of the classifier Q is given by 

9Tix,a) =l{T,,(g^{x),a)>l/2}, (5) 

where T„(gk(x),a) is defined as in ([^ but with the weights given by (1^). Ob¬ 
serve that if we choose a = 0 in (j^ and ^ we obtain the weights given in (j^ 
and the classifier Q respectively. 

Remark 1. a) The type of nonlinear aggregation used to define our classi¬ 
fiers turns out to be quite natural. Indeed, we give a weight different from 
zero to those Xj which classify x in the same group as the whole set of 

classifiers gA^j) /or 100(1 —a)% of them). 

b) Since we are using the inverse functions of the classifiers gmki observations 
which are far from x for which the condition mentioned in a) is fulfilled 
are involved in the definition of the classification rule. This may be very 
important in the case of high dimensional data to avoid the curse of dimen¬ 
sionality. This is illustrated in Figure [7| where we show two samples of 
points: one uniformly distributed in the square [—2, 2] x [—2, 2] (filled black 
points) and another uniformly distributed in the Lao-ring [—2,2] x [—1,1] 
(empty black points). We also show two points to classify, the empty red 
and the filled magenta triangles together with their corresponding voters, 
empty green squares and filled blue squares, respectively. As we can see, 
observations that are far from the triangles are also involved in the clas¬ 
sification. 
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Figure 1: Left: Sample points corresponding to two populations (black filled 
and empty circles) and two points to classify (red empty and magenta filled 
triangles). Right: in empty green squares the voters for the red empty triangle 
and in hlled blue squares the voters for the magenta filled triangle. 

3 Asymptotic results 

In this section we show two asymptotic results for the nonlinear aggregation 
classifier The first one shows that the classiher gT{X,a) is consistent if, for 
0 < a < 0.5, at least R > {1 — a)M of them are consistent. Moreover, rates 
of convergence for gxiX^a) (and gxiX)) are obtained assuming we know the 
rates of convergence of the R consistent experts. The second result, shows that 
griX) behaves asymptotically as the best of the M classifiers used to build it 
up. Both results are proved under mild conditions. Throughout this section we 
will use the notation P-Dfc(-) = P('|25fe)- 

Theorem 1. Assume that, for every m = . ,R, the classifier g^k converges 

in probability to g* as k ^ oo, with R > M{1 — a) and a € [0,1/2). Let us 
assume thatF{Y = l\g*{X) = 1) > 1/2 andFfY = 0|(7*(X) = 0) > 1/2, then 

a) lirn {gri^X, a) - L* = 0. 

rmn\kd\—¥oc> 

b) Let Prrik 0 as k ^ oo, for ra = 1,... ,i? and /Shjc = max fimk- If 
¥v,{g*(X) ^ g,^kiX)) = Oifi^k), then, for k large enough, 

F-Dt,{gT{X,a) ^Y)- L* = max { exp(-C'Z), , (6) 

for some constant C > 0. 

Remark 2. a) The assumption 

1) P(F = l|g*(X) = 1) > 1/2 2) P(y = 0|g*(X) = 0) > 1/2, (7) 

is really mild. It just requires that if the Bayes rule g*{X) takes the value 1 
(or 0) the probability that Y = \ is greater than the probability that T = 0 
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(the probability that Y = 0 is greater than the probability that Y = 1). 
Moreover since the Bayes risk L* < 1/2 one of the conditions in 0 is 
always fulfilled. 

b) It is well known that in the finite dimensional case, if the regression func¬ 

tion rj verifies a Lipschitz condition and X is bounded supported, the ac¬ 
curacy of classical classification rules is . Therefore the right 

hand side of ^ is 

O ( max { exp(-C'Z), }) , 

and the optimal rate for max { exp(—CZ), is attained for I ^ 

log(fc). 

c) The choice of the parameters a and I is an important issue. From a 
practical point of view, we suggest to perform a cross validation procedure 
to select the values of the corresponding parameters. See Section for an 
implementation in a real data example. 

In order to state the optimality result we introduce some additional notation. 
Let C = {0, l}'^ and let us call v G C. Calling v{m) the m-th entry of the 
vector V, we define the following subsets 

M M 

= n X {0}: = n ddivijn)) X {!}, 

m—1 m—1 

and = A^U A],. 

For each v G C, we consider the assumption: 

(H) HiVk):=FTy,{{X,Y)GAl)-VTy,{iX,Y)GAl)^0 a.s. 
Theorem 2. 1) For each m = 1,..., M, 

P^,,(5t(X) ^ y) ^Y)< 

which implies that, 

lim Px),( 5 t(^) fi-Y) < min Px.,( 5 mfc(^) ^Y). 

i—)-oo l<m<M 

2) Under assumption (TL) we obtain a better approximation rate, 

Pi,,(gT(X) ^ Y)-FT,,{gmkiX) ^Y) < Ofc(exp(-iLiZ)). 

4 A small simulation study 

In this section we present the performance of the aggregated classiher in two 
different scenarios. The first one corresponds to high dimensional data while, in 
the second one, we consider two simulated models for functional data analyzed 
in Delaigle and Hall |I6) . 
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n/k 

9t{-) 

9ln 

92 n 

gsn 

94:71 

95n 

96n 

gin 

98n 

400/300 

o o 

.045 

(.016) 

.043 

(.017) 

.042 

(.017) 

.042 

(.018) 

.043 

(.018) 

.043 

(.018) 

.043 

(.019) 

.044 

(.019) 

600/400 

.023 

(.012) 

.039 

(.015) 

.036 

(.016) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.036 

(.015) 

.037 

(.016) 

.037 

(.016) 

800/600 

.020 

(.010) 

.037 

(.014) 

.034 

(.013) 

.033 

(.013) 

.033 

(.013) 

.033 

(.013) 

.033 

(.013) 

.033 

(.013) 

.033 

(.014) 


Table 1: Mean and standard deviation of the misclassification error rate over 
500 replicates for with fixed number of neighbors. 


n/k 

9t{-) 

94ri 

92 n 

93 n 

9471 

gsn 

9Gri 

gin 

gSn 

400/300 

.025 

(.015) 

.045 

(.015) 

.040 

(.015) 

.040 

(.015) 

.040 

(.015) 

.040 

(.015) 

.040 

(.015) 

.040 

(.015) 

.040 

(.015) 

600/400 

.020 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

800/600 

.020 

(.007) 

.035 

(.015) 

.035 

(.015) 

.030 

(.015) 

.030 

(.015) 

.030 

(.015) 

.030 

(.015) 

.032 

(.011) 

.030 

(.015) 


Table 2: Median and MAD of the misclassification error rate over 500 replicates 
for with fixed number of neighbors. 


High dimensional setting 

In this setting we show the performance of our method by analyzing data ge¬ 
nerated in in the following way: we generate n -|- 200 iid uniform random 
variables in [0,1], say Zi,..., Zn+ 200 - For each i = 1,..., n-|- 200, if Zi > 
1/4, we generate a random variable Xi G with uniform distribution in 

[—2, 2]^®° and set Yi = 1. If Zi < 1/4, we generate a random variable Xi G 
with uniform distribution in Ty{[—2, 2]^®*^) where is the translation along the 
direction (v,... ,v) G for u = 1/4 and set Yi = 0. Then we split the sample 
into two subsamples: with the first n pairs (A^, Yi), we build the training sample, 
with the remaining 200 we build the testing sample. We consider two cases: the 
homogeneous case, where we aggregate classifiers of the same nature and in the 
heterogeneous case, where we aggregate experts of different nature. 

• Homogeneous case: M A:-nearest neighbor classifiers with the number of 
neighbors taken as follows: 

1. we fix M = 8 consecutive odd numbers; 

2. we choose at random M = 10 different odd integers between 1 and 

In Table we report the mean and standard deviation (in brackets) of the 
misclassification error rate for case when compared with the nearest neigh¬ 
bor rules build up with a sample size n taking 5, 7, 9,11,13,15,17,19 nearest 
neighbors (these classifiers are denoted by gmn for m = 1,..., 8). In Table we 
report the median and MAD (in brackets) of the misclassification error rate for 
this case. 

In Table we report the mean of the misclassification error rate and standard 
deviation for case with the original aggregated classifier and the two more 
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flexible versions: a = 1/8 and a = 1/4. In this table we compare the perfor¬ 
mance of our rules with the (optimal) cross validated nearest neighbor classifier 
computed with k and also with n. In Table we report the median and MAD 
of the misclassification error rate for this case. 

• Heterogeneous case: M = 5 classifiers: 3 fc-nearest neighbor rules with 
fixed values of k, the Fisher and the random forest classifiers. 

Here we take 3,5,7 nearest neighbors (denoted by gmn for m = 1,2,3), the 
Fisher classifier (denoted by gp) and the random forest classifier (denoted by 
gRp)- In Table we report the averaged misclassification error rates and stan¬ 
dard deviation and in Table we report the median and MAD for this case. 


Functional data setting 

In this setting we show the performance of our method by analyzing the following 
two models considered in Delaigle and Hall [16] : 

• Model I: We generate two samples of size n/2 from different populations 
following the model 

6 

Xpi (t) = ^ ( /Tp j (/j (t) -f epiit), p=l,2, 1 = 1 ,..., ti/2, 

i=i 

where sin(7rjt), gij and g,2,j are, respectively, the j-th co¬ 

ordinate of the mean vectors g,i = (0,—0.5,1,—0.5,1,—0.5), and fj ,2 = 


n/k 

9t(-) 

9t(-, 1/8) 

9t(-, 1/4) 

gCVn 

gcvk 

400/300 

600/400 

800/600 

.029 

(.016) 

.029 

(.016) 

.027 

(.014) 

.038 

(.019) 

.039 

(.019) 

.036 

(.018) 

.046 

(.021) 

.047 

(.022) 

.046 

(.020) 

.040 

(.017) 

.037 

(.016) 

.033 

(.014) 

.044 

(.018) 

.043 

(.018) 

.036 

(.015) 


Table 3: Mean and standard deviation of the misclassification error rate over 
500 replicates for with the number of neighbors chosen at random. 


n/k 

9t(-) 

9t(-, 1/8) 

9t{-, 1/4) 

gCVn 

gcvk 

400/300 

600/400 

800/600 

.025 

(.015) 

.028 

(.019) 

.025 

(.015) 

.035 

(.015) 

.035 

(.015) 

.035 

(.015) 

.045 

(.022) 

.045 

(.022) 

.045 

(.022) 

.040 

(.015) 

.035 

(.015) 

.035 

(.015) 

.042 

(.019) 

.040 

(.015) 

.035 

(.015) 


Table 4: Median and MAD of the misclassification error rate over 500 replicates 
for with the number of neighbors chosen at random. 
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n/k 

9t(-) 


92 n 

93n 

9F 

9RF 

400/300 

.012 

(.011) 

.049 

(.016) 

.043 

(.017) 

.041 

(.017) 

.020 

(.011) 

.004 

(.004) 

600/400 

.008 

(.007) 

.047 

(.015) 

.040 

(.015) 

.037 

(.015) 

.012 

(.008) 

.001 

(.002) 

800/600 

.007 

(.007) 

.043 

(.015) 

.036 

(.015) 

.034 

(.014) 

.009 

(.007) 

.000 

(.002) 


Table 5: Mean and standard deviation of the misclassification error rate over 
500 replicates for with fixed number of neighbors, Fisher classifier and 
random forest. 


n/k 

9t{-) 

Qlri 

92 n 

93n 

9F 

9RF 

400/300 

.010 

(.007) 

.050 

(.015) 

.040 

(.015) 

.040 

(.015) 

.020 

(.015) 

.000 

(.000) 

600/400 

.005 

(.007) 

.045 

(.015) 

.040 

(.015) 

.035 

(.015) 

.010 

(.007) 

.000 

(.000) 

800/600 

.005 

(.007) 

.040 

(.015) 

.035 

(.015) 

.035 

(.015) 

.010 

(.007) 

.000 

(.000) 


Table 6: Median and MAD of the misclassification error rate over 500 replicates 
for with fixed number of neighbors, Fisher classifier and random forest. 

(0, —0.75, 0.75, —0.15,1.4, 0.1) while the errors are given by 

40 

Bpi{t) = ^ p = 1,2, 

i=i 

with Zpj ~ A/’(0,1) and 9j = 1/j^. 

• Model II: We generate two samples of size n/2 from different populations 




Figure 2: Mean curve (Left) and Error curve (Right) of the two populations of 
Model II. 









































Model 

9t{-) 

9t(-, 1/5) 

9t(-,2/5) 

5t(-.3/5) 

9ln 

52n 

gsn 

94:71 

95n 

I 

.013 

(.011) 

.005 

(.005) 

.004 

(.005) 

.005 

(.006) 

.017 

(.011) 

.007 

(.007) 

.004 

(.004) 

.003 

(.004) 

.002 

(.003) 

II 

.110 

(.029) 

.074 

(.019) 

.068 

(.018) 

.069 

(.018) 

.124 

(.030) 

.083 

(.020) 

.070 

(.018) 

.066 

(.016) 

o o 


Table 7: Mean and standard deviation of the misclassification error rate over 
200 replicates for models I and II. 


following the model 

3 

(i) — 'y ^ (0 T P ~ i = Ij... 

i=i 

where fii = 0.75-(l, —1,1) and ^ 2 ,j the j-th coordinate of fi 2 = 0, 6j = I/j^ 
and the errors are given by 

40 

Spi{t) = y ' ^/^^pj4’j{^)^ p ~ ij 2, 

3 = 1 

with Zpj ~ A/’(0,1) and 9j = exp{—(2.1 — (j — l)/20)^}. 

This second model looks more challenging since although the means of the 
two populations are quite different, the error process is very wiggly, con¬ 
centrated in high frequencies (as shown in Figure left and right panel, 
respectively). So in this case, in order to apply our classification method, 
we have first performed the Nadaraya-Watson kernel smoother (taking a 
normal kernel) to the training sample with different values of the band- 
widths for each of the two populations. The values for the bandwidths 
were chosen via cross-validation with our classifier, varying the bandwidths 
between .1 and .7 (in intervals of length .05). The optimal values, over 
200 replicates, were hi — .\b for the first population (with mean ^i) and 
/i 2 = .7 for the second one. Finally, we apply the classification method to 
the raw (non-smoothed) curves of the testing sample. 

In Table we report the averaged misclassification error rate and the standard 
deviation over 200 replications for models I and II, taking n = 90, k = 60, I = 30, 
and a = 0,1,2, 3. In the whole training sample (of n functions) the n/2 labels 
for every population were chosen at random. The test sample consist of 200 


Model 

9t{-) 

9t{-: 1/5) 

9t(-,2/5) 

5t(-.3/5) 

9ln 

92n 

93n 

94n 

95n 

I 

.010 

(.007) 

.005 

(.007) 

.004 

(.007) 

.005 

(.007) 

.015 

(.007) 

.005 

(.007) 

.000 

(.000) 

.000 

(.000) 

.000 

(.000) 

II 

.105 

(.030) 

.070 

(.015) 

.065 

(.015) 

.070 

(.015) 

.120 

(.030) 

.080 

(.022) 

.070 

(.022) 

.065 

(.015) 

.065 

(.015) 


Table 8: Median and MAD of the misclassification error rate over 200 replicates 
for models I and II. 


9 


























data, taking 100 of every population. Here, gmn = (2m — l)-nearest neighbor 
rule for m = 1,..., 5. In Table we report the median of the misclassification 
error rate and the MAD. For Model I we get a better performance than the 
PLS-Centroid Classifier proposed by Delaigle and Hall [TB]. For model H PLS- 
Centroid Classifier clearly outperforms our classiher although we get a quite 
small missclassification error, just using a combination of five nearest neighbor 
estimates. 


5 A real data example: Analysis of spectrograms 




Figure 3: Spectrogram of a healthy (left panel) and a ovarian cancer suffering 
woman (right panel). 

The data to be analyzed in this section consists in the mass spectra from blood 
samples of 216 women of which, 121 suffer from an ovarian cancer condition and 
the remaining 95 are healthy women which were taken as control group. We 
refer to [2] for a previous analysis of these data with a detailed discussion of 
their medical aspects, see also [12] for further statistical analysis of these data. 
A spectrogram is a curve showing the number of molecules (or fragments) found 
for every mass/charge ratio and, the idea behind spectrograms, is to control 
the amount of proteins produced in cells since, when cancer starts to grow, 
its cells produce a different kind of proteins than those produced by healthy 
cells. Moreover, the amount of common produced proteins may be different. 
Proteomics, broadly speaking, consists of a family of procedures allowing re¬ 
searchers to analyze proteins. In particular, here we are interested in some 
techniques which allow to separate mixtures of complex molecules according to 
the rate mass/charge (observe that, molecules with the same mass/charge ratio 
are indistinguishable with a spectrogram). 

We have processed the data as follows: we have restricted ourselves to the 
interval mass charge (horizontal axis) [7000,9500]. Then, in order to have all 
the spectra dehned in a common equi-spaced grid, we have smoothed them via 
a Nadaraya-Watson smoother. Finally, every function has been divided by its 
maximum, in order to have all the values scaled in the common interval [0,1]. 
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Observe that our interest is to find the location of maxima amount of molecules 
more than the corresponding heights. 

To build the classifier introduced in ([^ we have taken 5 nearest neighbor clas¬ 
sifiers, with A: = 3, 5, 7,9 neighbors. We have implemented the cross validation 
method in a grid for {a,l), with a taking the values 0,1/5,2/5 and I taking 
60 values I = 20, 21,..., 80. The minimum of the misclassification error was 
attained for a = 2 and ^ = 31,..., 36 in whose case the accuracy obtained was 
95%. 


6 Concluding remarks 

• We introduce a new nonlinear aggregating method for supervised classifi¬ 
cation in a general setup built up from a family of classifiers gik, •. ■, gMk- 
It combines the decision of the M experts according to a “coincidence 
opinion” with respect to the new data we want to classify. 

• The new method, besides being easy to implement, is particularly well 
designed for high dimensional and functional data. The method is not 
local, and the use of the inverse functions prevent from the curse of di¬ 
mensionality that suffers all local methods. 

• We obtain consistency and rates of convergence under very mild conditions 
on a general metric space setup. 

• An optimality result is obtained in the sense that the nonlinear aggregation 
rule behaves asymptotically as well as the best one among the M classifiers 
(experts) gik,-- -.gMk- 

• A small simulation study confirms the asymptotic results for moderate 
sample sizes. In particular it is very well behaved for high-dimensional 
and functional data. 

• In a well known spectrogram curves dataset, we obtain a very good per¬ 
formance, classifying 95%, very close to the best known results for these 
data. 

• Although we have implemented cross validation to choose the parameters 
(a,Z) in Section]^ conditions for the validity of this procedure remains as 
an open problem. 

7 Appendix: Proof of results 

To prove Theorem we will need the following Lemma. 

Lemma 3. Let f{x) be a classifier built up from the training sample Dk such 

that Px)j,(/(A) 7 ^ g*{X)) —)■ 0 when k —>■ oo. Then, Fx>,,{f{X) ^ Y) — L* ^ 0. 
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Proof of Lemma^ First we write, 

Pi,, (fix) ^Y)-L*= Pp, (/(X) ^Y)- P{g*iX) ^ Y) 

= Fr,,{f{X)^Y,Y = g*{X)) 

+ Pi,, (fix) ^Y,Y^ g*{X)) - P{g*{X) ^ Y) 

= Fr,,{f{X)^g*{X)) ( 8 ) 

+ Pi,, (fix) ^Y,Y^g* {X)) - P{g*{X) ^ Y) 

= Pi,, {f{X) ^ g*iX)) - Pi,, {g*{X) ^ F, f{X) = F), 
where in the last equality we have used that 
P{g*{X) ^Y)= Pi,, {g*{X) ^ F, f{X) ^ Y) + Pi,, {g*{X) ^ F, /(X) = F), 
implies 

Pi,,( 5 *(X) ^ YJ{X) = F) = Pi,,( 5 *(X) YJ{X) ^Y)- P{g*{X) ^ Y). 
Therefore, replacing in Q we get that 

Pi,, {f{X) ^Y)-L*= Pi,, (/(X) g*{X)) - Pi,, {g*{X) ^ F, f{X) = Y) 

<Pi,,(/(X)^/(F)), (9) 

which by hypothesis converges to zero as fc —> oo and the Lemma is proved. □ 

Proof of Theorem [7| We will prove part b) of the Theorem since part a) is a 
direct consequence of it. By (§, it suffices to prove that, for k large enough: 

IP'Dfc(5T(^, a) 7^g*{X)) =o(^ma.x{exp{-C{n-k)),f3Rk}y 

We first split Fxi^{gT{X,a) g*{X)) into two terms, 

FR,{gT[X,a) ^ g*{X)) = FR,{gT{X,a) ^ g*{X),g*{X) = 1) 

+ Pi,, (5t (^, a) 9* (^), 9* (X) = 0) = / + IP 
Then we will prove that, for k large enough, 

/ = C>(^max{exp(-C'i(n- fc)),/3iifc}), 

for some arbitrary constant Ci. The proof that 

II = o(^max{exp{-C2in-k)),(3Rk}y 

for some arbitrary constant C 2 is completely analogous and we omit it. Finally, 
taking C = minjCi, (72}, the proof will be completed. In order to deal with 
term /, let us define the vectors 

gRk(7(l) = {9ik{X ),... ,gRk{X)) G {0,1}'^, 

(l,...,l,9(R+i)kiX),...,gMk{X)) 
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Then, 


I = ¥r>MX,a) + g*{X),g*{X) = 1) 

< Vr^MX.a) ^ g*{X),g*{X) = l,gRk(X) = 1) 

R 

+ ^ ¥vd 9 T{X,a) ^ g*iX),g*{X) = l,g^k{X) = 0) 

m=l 

< Fvd9TiX,a) ^ g*iX),g*{X) = l,gRk(X) = 1) 

R 

+ Y.Vv,{g*{X)^gmk{X)) 

m—1 

< Vv, (T„(gk(X),a) < l/2\g*iX) = l,gRk(X) = l) 

R 

+ Y,¥vM{X)^g^^{X)) 

m—1 

= Ia + Ib- 


Observe that, conditioning to gRk(-^) = 1 and defining 


1 

I M Z.™ = 1 h9™fc<Vj) = ..(™)} 

we can rewrite T„(gk(X),a) as 


>!-«}’ 


T„(gk(X),a) = 


y ■ 

Z^i=fe+l 


Therefore, 

IA = Ppfc 


n-k Sj=fc+1 ^ 1 


1 Y^n y 

n-k Z^f=fc+1 


g*(X) = l,gRk(X) = l 


= ^ E - 1/2) < 0|ff*(X) = l,gRk(X) = 1 . (10) 

\ i=fe+i / 

In order to use a concentration inequality to bound this probability, we need to 
compute the expectation of Zj{Yj — 1/2) = ZjYj — Z^j^. To do this, observe 
that 


1 


M 


E{Z,Y^) = ^ I 


{gmk{X)=i^{m)} ^ 


> 1 - a,r = 1 


and 


Since 


/ M 

E{Z,} = Pp. 77 S I 


m—1 

M 

M 


^{9mk{X)=u{m)} E 


> 1 — a \ . 


m—1 

M 


{gRk(^) = 1} C <1 ^ ^ I 


{gm.kiX)=iz{m)} ^ 


> 1 - a ^ 


m—1 
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we have, 


E{Z,Y,)-E{Z,)l2 = Vr,,{y = -P^,(A„)/2 

= Pp,(^„)[Pp,(r = l|A<,)-l/2 

> Pp.(gRk(^) = 1) [Pr.(F = l|Aa) - 1/2 

Now, since for m = 1,..., i?, Qmk —>■ 9* in probability as fc —>■ oo, 


( 11 ) 


Pr. (gRk(X) = 1) ^ P(5*(X) = 1) = p* > 0. (12) 

On the other hand, we have that, for k large enough, P-Dfc(h" = l|^a) > 1/2. 
Indeed, for m = 1,. .., i?, let us consider the events Bmk = {gmk{X) = g*{X)} 
which, by hypothesis, for k large enough verify 

P( Bmk) > 1 - e, 

for all e > 0. In particular, we can take e > 0 such that P(F = 1|(7*(X) = 
I)(I — e) > 1/2. This implies that 


p„,(r = i|A„) = 


> 


Px>,(y = l,Aa„ri^^^Bmk) 

PCfe {Aa) 

Ppjy = 

PpJ^a) 

P-Dfc(l^ = Aa,r\m=lBmk) 


> 


Px>fe (^a) 

PPfc(y = l>^a| Om=l Bmk) 
P-dJ^o) 


(l-£). 


Conditioning to C]m=iBmk the event Aa equals Ca given by 

|M{g.(x)=i}+ ^{grr^k(x)=u{m.)}> M{1-a\ = Ca- 

I ?71—i?+l J 


(13) 


(14) 


However, a < 1/2 imply that Ca = {g*{X) = 1}. Indeed, from the inequality 
R > M(1 — a), it is clear that {g*{X) = 1} C Ca- On the other hand, 
R > Mil — a) > M/2 and a < 1/2 imply that M — R < M/2 < M(1 — a), 
and so the snm in the second term of (14) is at most M — R and consequently, 
{g*{X) = 1}'^ C Ca- Then, combining this fact with (131 we have that, for k 
large enough 


Pi,,(y = i|H„)> 


Pp,(y=l,g*(N) = l|n^^i Bmk) 

P^,(g*(X) = l) 

Pp,(y=l,g*(^) = l) .. ^ 

¥^,{g*{X) = l) ^ > 

¥{V=l\g*{X) = l)il-e) 


> 1 / 2 . 


e) 


(15) 
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Therefore, from (12) and in ( [Tl] ) we get 

E{Z,Y,) - E{Z,)/2 > c> 0. 


Going back to ( (l0| ), conditioning to iy(X) and using the Hoeffding inequality for 
~ 1/2)1 £ 1/2, for k large enough we have 


WXl E - 1/2) - E{ZAYj - 1/2))) > c\g\X) = l,gRk(X) = 1 

y j=k+i 

< exp {—Cl (n — fe)} , 

with Cl = 2c^. On the other hand, by hypothesis we have 

M 

Ib=Y. ^ gmk{X)) = 0{f3Rk), 

m—1 

which concludes the proof. □ 

Proof of Theorem^ First we write, 

Vr, (gTiX) ^Y)= (T„(gk(X)) > 1/2, F = O) 

+ P^,,(r„(gk(X))<l/2,F = l) 

= ^ (T„(gk(X)) > 1/2, (X, Y) e Al) (16) 


i/GC 


+ ^ ¥r, (T„(gk(X)) < 1/2, (X, F) e Al) 


vei 

= I + IL 


Let us take i' fixed. Observe that in this case, T„(gk(X)) depends only on 
the subsample £i, therefore the events {Xj,Yj) G A^ and {X,Y) G Al, are 
independent for all i = 0,1, j = fc + 1,..., n. Then, 


/ = ^ P„, (T„(gk(X), 0) > 1/2, (X, F) G Al) 
vec 

fUj ■■ {Xj,Y,) G Al} ^ Uj ■■ iXj,Y,) G GlO} 


= Ep^^ 


i-GC 




> 


= ^f'MiEEiEEEi > i{j ■■ (x„Y,) e Ai} \ 


i^ec 


,(X,Y)€Al^ 

Ep, ((X,F)g4) 


= Ep^^ 

i-GC 


f 1 " 

7 E h{X,,Y,)GAl} -\{X,,Y,)gA^} 

\ j=k+l 



Pp, ((X,F) G Al) . 


Let us define 

E = h(X,,Y,)GAl} - I{(X,.Y,)gAO}, 
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pl=FTy^{{X,Y)&Al), i = 0,l, 

= E{T^) =pI-pI^Ki=2 min pi, and al = E[{T!lf) = pi + pi- 

To bound term I, we will consider 3 cases, < 0, > 0 and = 0. Let us 

first assume that < 0. In this case, using the Hoeffding inequality we have. 


E -P-) >-P A = Ok{eM-Kil)). 

\ 3=k+l ) 

If Pn > 0, using Hoeffding inequality again we get 


(17) 



>0 “l-lf^. 7 S 


T/ < 0 


j=k+l 


= lEOk{e3xp{-Kil)). (18) 

If pu = 0, since for all v and j, E{\Tl'\^) = 1, using the Berry-Esseen inequality 
we get 




i=k+l 
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{<t2>0}- 


(19) 


Observe that, since P(F = 1) = y) „p(, and P(F = 0) = YlvP^^ there exists v 
such that al > 0. Then, from ([TtI^ (181 and (191 we get 


(y E t^>ApI 

i^ec y j^k+i J 

= E f y E IT > o) P° + E f y E o] pi 


Pu<0 




Pu>^ 


j = k + l 


+ E IP® J y E t">q\pI 


Pu=0 


j=k+\ 


— C?fe(exp( —A'il))l^;3,^.p_^^o} +Ok{l ^^^)l{3i/:(p„=0,crj>0)} 

+ E + y E 

Piy>0 Pu=0 

Analogously, it is easy to prove that 

II = C>fc(exp(-Aril))l{3,,,p^^o} + 

+ E pi + y E 


Pu<0 


Pl3—0 


( 20 ) 


( 21 ) 
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where in the l ast ter m we have used that p,y = 0 implies = p\,- Therefore, 
with ( [^ and in we get, 

'^'DkigriX) ^Y) = Ok{exp{-Kil))li3^,p^^0} + C>fc(^~^^^)I{3i/:(p„=o.(T2>o)} 

+ ^ p° + ^ pI + ^2 

Pi,>0 Pi^<0 Pi/=0 

= Ok{exp{ — Kll)'jl^ji,.p^^oy + Ok{l ^^2^{3i':(Pi,=0,rTl>0)} 

+ ^ p°+ ^ p° + X] pi + pi+y^ p°- 

h':v'{m) = 0 u:u{m) = l v':v'{m)=0 u:u{7n) = l Pi/=0 

Piy>0 Pl/>0 Pl/<0 Pp'<0 


On the other hand, for each m we have. 


!Pr>fc (gmk(X) 7 ^ y) = [gmk{X) = 0, y = l) + Pu^ [gmk{X) = 1, y = O) 


= u ^i U 


\ uw{m)=0 

= I] pi + I] pi 

u\u{m)=0 i/':i/'(m) = l 


t i/':i/'(Tn) = l 


(23) 


= ^ pi+pi+p°+pi+y^ pi' 

Pj^<0 Pp'> 0 Piy<0 Pu>0 

where in the last equality we used again that p^ = 0 implies = pi to joint 


I] pi + I] pi = I] pi- 

iy:iy(m) — l iv':i^(m)=0 Pu—O 

Piy=0 Pu—0 

Therefore, from ( [^ and ( |^ we get 
(ffT(^) ^Y)- P^, (g^k(X) ^ V) 

— ^/c{6Xp( Y Ok(^^ ^ ')^{3L>:{pj^—0,a‘^>0)} 

+ J2 (pi-pi)+ H (pi-pi) 

Pi,>0 Pu<0 

^ ^/i:( 6 Xp( ^ 0,(7^ > 0 )} ■ 


Observe that, if 7 ^ 0 for all v we get P-Dfe {gT{X) ^ Y) -Px)j, {gmk{X) ^y) < 
Ok{exp{-lKi)). 

□ 
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